Testing the accuracy of automated classification systems using only expert ratings that are less accurate than the system
نویسنده
چکیده
A method is presented to estimate the accuracy of automated classification systems using only expert ratings that may be substantially less accurate than the systems being evaluated. The estimation method begins with multiple expert ratings on test cases, uses the level of inter-rater agreement to estimate rater accuracy, uses Bayesian updating based on estimated rater accuracy to estimate a “ground truth” probability for each classification, and then estimates system accuracy by comparing the relative frequency that the system agrees with the most probable classification at different probability levels. A simulation analysis provides evidence that the method is robust and yields reasonable estimates of system accuracy under diverse and predictable conditions. Acknowledgment: This research was supported by the Intelligence Advanced Research Projects Activity (IARPA). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA or the U.S. Government. ©2014-The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. 13-4068 2 Testing the accuracy of automated classification systems using only expert ratings that are less accurate than the system INTRODUCTION Information technology is advancing to develop systems that address problems of increasing sophistication and complexity. New systems are being developed to address complex problems as diverse as automated medical and clinical diagnoses, technology readiness evaluation, detection of emerging technologies, classification of the contents of unstructured video segments, recognition and classification of metaphors used in natural language text and many others. 1 The complexities of the problems that these advanced systems address make it difficult to evaluate the accuracy of such systems. It is usually necessary to resort to using expert raters to establish the ground truth of test cases. However, the fact that these systems are addressing complex problems also presents a challenge to the expert raters. Expert raters often disagree as to the correct answer. Furthermore as future systems address problems of ever increasing sophistication and complexity, it seems likely that the experts will be even more challenged and exhibit even lower levels of agreement. Ground truth data sets based on expert judgment are fallible and are likely to become more so in the future. Using expert raters to establish the ground truth of test cases is certainly not a new practice. For classification problems, which are the focus of this paper, it is a common scientific practice to measure the level of agreement amongst raters, with a statistic such as Kappa, and to refine the rating process until a satisfactory level of agreement is reached. Once the agreement threshold is reached, then the judgments of individual raters or collaborating teams of raters are treated as the ground truth (See Gwet, 2010 for review). For several reasons, this common scientific practice does not adequately meet the needs of advanced system evaluation. First, the level of agreement among raters will rarely meet a satisfactory level. The problems that these systems address are simply too complex. About the only way to increase the level of agreement is to select simple, non-representative test cases. Second, estimating system accuracy by measuring the level of agreement with expert raters makes the de facto assumption that the experts are more accurate than the system. This assumption runs contrary to a substantial body of empirical research where it is very often found that simple algorithms 1 A quick look at the web sites for the Intelligence Advanced Research Projects Activity Office of Incisive Analysis (http://www.iarpa.gov/office_incisive.html), the Defense Advanced Research Projects Agency (http://www.darpa.mil/), the National Science Foundation (http://www.nsf.gov/), and the National Institutes of Health (www.nih.gov) will reveal many such development efforts. Approved for Public Release; Distribution Unlimited. 13-4068 3 outperform human experts in complex judgments (Dawes, 1979; Grove, et. al. 2001, Tetlock 2005). It should not be presumed that the experts are more accurate than the system. Third, there is considerable evidence to suggest that for a wide variety of judgment tasks collaborative team judgments are not substantially more accurate than the judgments of a randomly selected individual team member (e.g., Surowiecki, 2005; Armstrong, 2006). In judgment tasks, where there is no obvious correct answer, it should not be presumed that collaboration will reliably lead the raters to converge on the correct answer. Finally, when evaluating a classification system the statistic of greatest interest is the accuracy of the system the proportion of system assignments that are correct. Unfortunately the relationship between inter-rater reliability statistics such as Kappa, the probability of correct ground truth assignments and the accuracy of any systems tested against error-prone ground truth assignments is unclear. This paper presents a different approach to using expert ratings to estimate the accuracy of complex systems. Rather than treat expert ratings as a surrogate for ground truth, expert ratings are treated as error prone estimates of ground truth where independent ratings are fused to generate estimated ground truth probabilities, and the ground truth probabilities are then used to estimate system accuracy. This paper makes several strong claims for this approach to estimating the accuracy of classification systems. First, under diverse conditions, this approach reliably yields estimates of system accuracy that are approximately correct. If the system is 90% accurate then this approach will yield an estimate of system accuracy that is close to 90%. Second, the accuracy of the estimate of system accuracy is largely independent of whether the expert raters are more or less accurate than the system. If the system is in fact 90% accurate, and the raters are individually 60% accurate, then the estimate of system accuracy will still be approximately 90%. Third, reliable estimates of system accuracy can often be obtained with reasonable sample sizes. Some of the simulation runs shown below use a sample size of just 50 test cases with three independent raters. In complex domains it is important to keep sample sizes as small as possible, since it may be time consuming and costly to obtain expert ratings. Fourth, and importantly, the conditions under which the above claims may break down are predictable. Therefore test data sets can be intentionally constructed to ensure that the conditions are met that are needed for accurate estimation of system accuracy. The objective of this paper is to present evidence for the above four claims. This is achieved by describing one practical instantiation of this approach, along with sufficient test results to present clear evidence for the first three claims and some evidence for the fourth claim. Approved for Public Release; Distribution Unlimited. 13-4068 4 A METHOD FOR ESTIMATING THE ACCURACY OF SYSTEM CLASSIFICATIONS The method for estimating accuracy used in this paper was derived from the following assumptions. A1. For each case there is a unique correct classification. A2. For each case raters independently assign classifications. A3. Expected agreement between raters increases as expected rater accuracy increases. Assumption A3 refers to expected agreement and accuracy. Here “accuracy” refers to the total proportion of correct classifications made by all the raters, irrespective of which raters are making correct and incorrect classifications. And “agreement” refers to the total proportion of pairwise agreement among all of the raters and cases. For any particular set of cases, accuracy may be low yet agreement high (the raters made the same mistakes), but A3 asserts that in general there is an expected positive relationship between accuracy and agreement. Theorem 1: A1 – A3 are ensured if and only if the raters behave as though their selection for each case is determined by a single confusion matrix where the conditional probability of correct assignment is constant and the conditional probability of all incorrect assignments is equal. That is to say all raters on all problems are selecting from a single confusion matrix with a structure such as shown in Table 1. Table 1: Implied Structure of Rater Confusion Matrices for Four Category Problem (A though D are true category and “A” through “D” are selected categories.) “A” “B” “C” “D” A Pc (1-Pc)/3 (1-Pc)/3 (1-Pc)/3 B (1-Pc)/3 Pc (1-Pc)/3 (1-Pc)/3 C (1-Pc)/3 (1-Pc)/3 Pc (1-Pc)/3 D (1-Pc)/3 (1-Pc)/3 (1-Pc)/3 Pc The proof of this theorem is found in the Appendix. The general structure of the proof shows that if the raters are assigning classifications using any process other than selecting from a common confusion matrix with the structure illustrated in Table 1, then it is always possible to construct a classification process with lower expected accuracy and higher expected agreement, or higher accuracy and lower agreement; thereby violating the assumed monotonic relationship between expected accuracy and expected agreement. Approved for Public Release; Distribution Unlimited. 13-4068 5 A1 through A3 also seem to be implicitly assumed in many contexts where the Kappa statistic is applied. Indeed it is A3 that would seem to warrant the common practice of using expert ratings as surrogates for ground truth when high levels of inter-rater agreement are found. Consequently it is reasonable to claim that the estimation method described below is derived from assumptions implicit in the Kappa statistic and how Kappa is often used. Because of this relationship to the Kappa statistic, in the remainder of this paper A1 – A3 will be referred to as K-assumptions. Furthermore, the properties of equal rater accuracy, equal error probabilities and equal problem difficulty that are implied by the Kassumptions will be referred to as K-properties. Table 2: Sample data of expert ratings and system assignments for 10 test cases Case # Rater 1 Rater 2 Rater 3 Rater 4 System 1 “C” “D” “C” “C” “A” 2 “B” “D” “C” “C” “C” 3 “C” “C” “D” “C” “C” 4 “B” “B” “D” “D” “B” 5 “A” “B” “B” “B” “B” 6 “C” “B” “D” “A” “A” 7 “A” “A” “A” “A” “A” 8 “A” “D” “B” “C” “C” 9 “D” “B” “A” “A” “D” 10 “A” “D” “A” “B” “B” The estimation method is straightforward to explain in the context of an example. Consider the test data in Table 2. There are 10 test cases, 4 categories, 4 raters and the system’s proposed answers. When referring to ground truth label the four categories are labeled A, B, C, D; when referring to rater and system assignments they are labeled “A”, “B”, “C”, “D”. As described below the estimation method is composed of four basic steps. Estimate rater accuracy Given that each rater has an identical confusion matrix, with the structure shown in Table 1, the probability that two raters will agree on any one case is = + Here Pa is the probability of agreement, Pc is the probability that a rater will make the correct assignment, and N is the number of categories. Approved for Public Release; Distribution Unlimited. 13-4068 6 Solving for Pc yields = + ∗ (Eq. 1) Eq. 1 is used to estimate rater accuracy. In the 10 cases in Table 1 there was 33% agreement (20 pairs out of 60). Setting Pa to .33 and solving for Pc yields Pc = 0.5; which is the estimate of rater accuracy.
منابع مشابه
Estimating the Accuracy of Automated Classification Systems Using Only Expert Ratings that are Less Accurate than the System
A method is presented to estimate the accuracy of an automated classification system based only on expert ratings on test cases, where the system may be substantially more accurate than the raters. In this method an estimate of overall rater accuracy is derived from the level of inter-rater agreement, Bayesian updating based on estimated rater accuracy is applied to estimate a ground truth prob...
متن کاملRetrieval–travel-time model for free-fall-flow-rack automated storage and retrieval system
Automated storage and retrieval systems (AS/RSs) are material handling systems that are frequently used in manufacturing and distribution centers. The modelling of the retrieval–travel time of an AS/RS (expected product delivery time) is practically important, because it allows us to evaluate and improve the system throughput. The free-fall-flow-rack AS/RS has emerged as a new technology for dr...
متن کاملFeasibility Study of Real-time and Automated Monitoring of Iranian Rivers using 50-kHz Fluvial Acoustic Tomography System
Acoustic Tomography (AT) technique is an innovative method for real-time river monitoring. In this study, not only the accuracy of flow velocity measurement using 50 kHz AT system which is appropriate for narrow rivers (most Iranian rivers) is evaluated, but also its performance is compared with 30 kHz one which is used in wide rivers. The comparison results showed that the velocity resolutions...
متن کاملImprovement in Differential GPS Accuracy using Kalman Filter
Global Positioning System (GPS) is proven to be an accurate positioning sensor. However, there are several sources of errors such as ionosphere and troposphere effects, satellite time errors, errors of orbit data, receivers errors, and errors resulting from multi-path effect which reduce the accuracy of low-cost GPS receivers. These sources of errors also limit the use of single-frequency GPS r...
متن کاملAutomatic road crack detection and classification using image processing techniques, machine learning and integrated models in urban areas: A novel image binarization technique
The quality of the road pavement has always been one of the major concerns for governments around the world. Cracks in the asphalt are one of the most common road tensions that generally threaten the safety of roads and highways. In recent years, automated inspection methods such as image and video processing have been considered due to the high cost and error of manual metho...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014